Take-home Exercise 1

Creating data visualisation beyond default

Clarence Tay https://www.linkedin.com/in/clarencetay/ (Singapore Management University - MITB)https://scis.smu.edu.sg/master-it-business
April 24, 2022

Overview

In this take-home exercise, I will be exploring and revealing the demographic of the city of Engagement, Ohio USA by using appropriate static statistical graphics methods. The data will be processed by using appropriate tidyverse family of packages and the statistical graphics will be done with ggplot2 and its extensions. Datasets used will be taken from the VAST Challenge 2022.

Getting Started

Before I get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, I will install the R packages and load them onto R environment.

The chunk code below will do the trick.

packages = c('tidyverse','psych','plotly')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
    }
  library(p, character.only = T)
  }

Importing Data

The code chunk below imports Participants.csv from the data folder, into R by using read_csv() of readr and save it as an tibble dataframe called participants_data.

participants_data <- read_csv("data/Participants.csv")


After importing the Participants.csv, I used the function glimpse() of dplyr, like its name suggests, to get a glimpse of the data that I am are working on.

glimpse(participants_data)
Rows: 1,011
Columns: 7
$ participantId  <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age            <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup  <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality      <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~

From the output, we know that the dataset consists of 1011 unique participants and has 7 different columns of metadata related to the participants.

It seems like the data in column ‘participantId’ are just unique tags given to each participant in this study, hence is likely not useful in this preliminary analysis.

The data in columns ‘age’ and ‘joviality’ appears to be continuous data type, while data in column ‘householdSize’ is likely to be a discrete data type.

The data in columns ‘haveKids’, ‘educationLevel’ and ‘interestGroup’ appears to be categorical data type. With the above initial observations, I would be able to see what are the unique values and count in these categorical and discrete data type columns.

table(participants_data$householdSize)

  1   2   3 
337 373 301 

For ‘householdSize’, there are 3 unique values of 1, 2 and 3. Also, the count of these unique values appears to be quite even split out.

table(participants_data$haveKids)

FALSE  TRUE 
  710   301 

For ‘haveKids’, it consists of either True or False (boolean), with around 70% of the participants not having kids.

table(participants_data$educationLevel)

          Bachelors            Graduate HighSchoolOrCollege 
                232                 170                 525 
                Low 
                 84 

For ‘educationLevel’, there are 4 unique types, namely Bachelors, Graduate, High School/College and Low. A quick glance tells us that bulk of the participants (~50%) are only having high school/college qualifications. Lowly educated participants are minority in this case.

table(participants_data$interestGroup)

  A   B   C   D   E   F   G   H   I   J 
102  91 102  96  83 106 108 111  96 116 

For ‘interestGroup’, there are 10 unique types, named from A to J (actual interest group names are redacted). A high level glance at the numbers tells us that the number of participants in the various interest groups are quite evenly split too.

For curiosity, the describe() of psych package, was used to get a brief statisitical understanding of the dataset too.

des <- describe(participants_data, fast = TRUE)
print(des, digits=5)
               vars    n      mean        sd     min        max
participantId     1 1011 505.00000 291.99486  0.0000 1010.00000
householdSize     2 1011   1.96439   0.79399  1.0000    3.00000
haveKids          3 1011       NaN        NA     Inf       -Inf
age               4 1011  39.07418  12.37930 18.0000   60.00000
educationLevel    5 1011       NaN        NA     Inf       -Inf
interestGroup     6 1011       NaN        NA     Inf       -Inf
joviality         7 1011   0.49379   0.29135  0.0002    0.99923
                    range      se
participantId  1010.00000 9.18332
householdSize     2.00000 0.02497
haveKids             -Inf      NA
age              42.00000 0.38933
educationLevel       -Inf      NA
interestGroup        -Inf      NA
joviality         0.99903 0.00916

From the output, the observations are,

Time to plot some exciting charts!!!

Histograms

The code chunk below plots a histogram by using geom_histogram() of ggplot2.

ggplot(data = participants_data,
       aes(x=age)) +
  geom_histogram(bins=20, fill = 'light blue', color='black') +
  ggtitle("Histogram of Participants' age") +
  xlab('Age') +
  ylab('Count') +
  scale_x_continuous(breaks = seq(10, 70, by = 2)) + 
  scale_y_continuous(breaks = seq(0, 100, by = 5))

From the above chart, we are able to see the distribution of the participants across the various age bins.

From this simple histogram, we can see that participants of age 18-19 are the minority, while participants of age 30-31 have the largest proportion within this dataset population. Other notable age groups with higher proportion are of age 42-43 and 52-53.


Understanding that ggplot also allows us to fill in the chart with another additional parameter, I’ve added the ‘fill’ parameter into the code chunk that would embed the ‘educationLevel’ data into the chart too, as seen below.

ggplot(data = participants_data,
       aes(x=age, fill= educationLevel)) +
  geom_histogram(bins=20, color='black') +
  ggtitle("Histogram of Participants' age, filled by Education Level") +
  xlab('Age') +
  ylab('Count') +
  labs(fill="Education Level") + 
  scale_x_continuous(breaks = seq(10, 70, by = 2)) + 
  scale_y_continuous(breaks = seq(0, 100, by = 5))

Now we can see the distribution of the various academic qualifications of the participants across the age histogram. Visually, it is obvious that within the age groups with larger proportion (eg. age 30-31 and 52-53) also have a larger proportion of them with high school/college qualifications.

With respect to the larger age 30-31 group, assuming that the dataset is dated for 2022, it would mean that these participants were born in around the year 1991-1992. Coinciding with this time period was the early 1990s economic recession era.

Incidentally, if we look at the peak groups (age 41-42, 52-53), the year of birth from these participants are 1980-1981 and 1969-1970, which also coincide with the economic recession period back then.

Assuming my assumption is correct, throughout, we can also see that after these spikes in births, the subsequent years saw a relatively sharp decline in numbers before creeping up gradually.

Usually, birth rates declined after economic crisis happens. However, more cross-referencing data is required to reveal more and confirm on interesting observations.


A additional tweak applied was to use plotly package to create interactive charts, as seen below.

ggplotly(ggplot(data = participants_data,
       aes(x=age, fill= educationLevel)) +
  geom_histogram(bins=20, color='black') +
  ggtitle("Histogram of Participants' age, filled by Education Level") +
  xlab('Age') +
  ylab('Count') +
  labs(fill="Education Level") + 
  scale_x_continuous(breaks = seq(10, 70, by = 2)) + 
  scale_y_continuous(breaks = seq(0, 100, by = 10)))

With this, we are now able to recieve more microdata (eg. count and average) regarding the different sub-groups in the chart as we hover our cursor over them.


While the previous chart combines everything into 1 chart, supposedly for a 1-stop chart solution, the distribution and count of the participants may not be that obvious.

Hence the next chart includes facet_wrap() into the code chunk, which splits out the output into separate mini-charts grouped by the Education Level data.

ggplotly(ggplot(data = participants_data,
       aes(x=age, fill= educationLevel)) +
  geom_histogram(bins=20, color='black') +
  ggtitle("Histogram of Participants' age, filled by Education Level") +
  facet_wrap(~educationLevel) +
  xlab('Age') +
  ylab('Count') +
  labs(fill="Education Level") + 
  scale_x_continuous(breaks = seq(10, 70, by = 5)) + 
  scale_y_continuous(breaks = seq(0, 100, by = 10)))

Now, it will be clearer to see the distribution of the education qualification of the participants across various ages.


Now that we have established the rough idea of how I am going to plot the charts, let’s move on to explore the demographics with a slight change in the analysed parameter.

Here we will be replacing the ‘age’ column data in x, with ‘joviality’ column data.

ggplotly(ggplot(data = participants_data,
       aes(x=joviality, fill= educationLevel)) +
  geom_histogram(bins=10, color='black') +
  ggtitle("Histogram of Participants' joviality, filled by Education Level") +
  xlab('Joviality') +
  ylab('Count') +
  labs(fill="Education Level") + 
  scale_x_continuous(breaks = seq(0, 1, by = 0.1)) + 
  scale_y_continuous(breaks = seq(0, 140, by = 10)))

Based on the chart, it appears that there is a good spread of participants (with varying academic qualifications) across, except for the two extreme ends (left being the very unhappy group, right being the seriously happy group). Proportion wise, it seems like education status does not have much impact on the happiness level.


The same chart is plotted with plotly package.

ggplotly(ggplot(data = participants_data,
       aes(x=joviality, fill= educationLevel)) +
  geom_histogram(bins=20, color='black') +
  ggtitle("Histogram of Participants' joviality, filled by Education Level") +
  facet_wrap(~educationLevel) +
  xlab('Joviality') +
  ylab('Count') +
  labs(fill="Education Level") + 
  scale_x_continuous(breaks = seq(0, 1, by = 0.1)) + 
  scale_y_continuous(breaks = seq(0, 140, by = 10)))


Next up, a chart was plotted to see the distribution of participants of various ages with their respective proportion of whether they have kids.

ggplotly(ggplot(data = participants_data,
       aes(x=age, fill= haveKids)) +
  geom_histogram(bins=20, color='black')+
  ggtitle("Histogram of Participants' age, filled by whether they have kids") +
  xlab('Age') +
  ylab('Count') +
  labs(fill="Have Kids") + 
  scale_x_continuous(breaks = seq(10, 70, by = 2)) + 
  scale_y_continuous(breaks = seq(0, 140, by = 10)))

Coincidentally, the larger groups for age 30-31, 41-42 and 52-53, seems to have a larger proportion of them not having kids. If we were to make a bold assumption that birth bearing age at ~25 to 30 years old, it seems like these group of participants were also experiencing a economic recession during their prime child bearing age.

Example, for the age group of 52-53, it was the year of 2000 when another economic recession happened when these participants were around 30 years old.


Scatter Plot

A scatter plot was charted with participants’ age and joviality. It seems like there is no correlation at all between the participants’ age and their happiness level.

ggplot(data = participants_data,
         aes(x=age, y= joviality)) +
  geom_point() +
  geom_smooth(size=0.5) +
  ggtitle("Scatter Plot of  Participants' Joviality vs Age") +
  xlab('Age') +
  ylab('Joviality') +
  scale_x_continuous(breaks = seq(10, 70, by = 2)) + 
  scale_y_continuous(breaks = seq(0, 1, by = 0.1))


Boxplot

Next up, a boxplot was plotted with geom_boxplot() and geom_point(), with facet_wrap() as below.

ggplot(data = participants_data,
         aes(y=age, x = interestGroup)) +
  geom_boxplot() +
  geom_point(stat='summary',
             fun.y='mean',
             colour='red',
             size=2) +
  facet_wrap(educationLevel~.) +
  ggtitle("Box Plot of Participants' Age across various Interest Group, grouped by Education Level") +
  xlab('Interest Group') +
  ylab('Age')

Some key observations are,

While the actual interest group type/names are redacted, this insight does tell us how different age group of various education status has a impact on the kind of interest groups they are likely to be in.


Following, we will look at the same boxplots but with respect to Joviality instead of Age.

ggplot(data = participants_data,
         aes(y=joviality, x = interestGroup)) +
  geom_boxplot() +
  geom_point(stat='summary',
             fun.y='mean',
             colour='red',
             size=2) +
  facet_wrap(educationLevel~.) +
  ggtitle("Box Plot of Participants' Joviality across Interest Group, grouped by Education") +
  xlab('Interest Group') +
  ylab('Joviality')

Some key observations are,


In our next boxplot, I’ve added another dimension (whether the particpants have kids or not) to see if there is any more interesting observations.

ggplot(data = participants_data,
         aes(y=age, x = haveKids)) +
  geom_boxplot() +
  ggtitle("Box Plot of Participants' age across various Interest Group, grouped by Education") +
  xlab('Have Kids?') +
  ylab('Age') +
  facet_grid(interestGroup~educationLevel) +
  coord_flip()

Some key observations are,

While it may not be intuitive now, but I believe with deeper analysis, we may be able to connect the dots and uncover some insights.

With that, this is the end of my take-home exercise 1. =)